We begin with SynMap output based on the Schnable 2011 paper (see my notes for details on parameter and download options).

Expression data was retrieved from NCBI: GSE50191 and was reported in Walley 2016. This set includes 69 experiments (3 biological replicates, 23 tissues) for 62,547 mRNAs aligned to Zea mays B73 RefGen_v2 5a WGS annotation set.

Protein abundance data was retrieved from Walley 2016 supplamental table S2. This dataset contains 3 to 7 biological replicates over 33 tissues. Expression data is based on a subset of this protein data tissues. Values are in dNASF, which is summarized here.

1 Import SynMap and GO Term Data


We parse the SynMap output file with ks/kn values (i.e. a DAGChainer file) to a flat format using a perl script (parseKsKnFile.pl) and import the data into R. Some ks and kn values are empty, which is to be expected. These are ignored for the purposes of finding the median, but included in the results if they are in a block that meets the cutoff criteria. GO terms related to Maize gene models are imported as well.

Lets take a look at what we just imported.


2 Choosing a KS cutoff value


We need to select a cutoff value for ks (synonymous mutation rate) to differentiate between the most recent alpha duplication event and the previous beta duplication. Looking at ks value frequency (left), we should not see any obvious cutoff point. By transforming ks values to block median values, we should see two (or more) peaks (center). In this (center) graph, the first peak represents orthologs and the second represents homeologs from “pregrass tetraploidy”. If the vertical line separates the first two peaks sufficently, it is probably a good enough value. If not, we need to pick a new cutoff value for ks. Viewing the log10 transform of the ks values helps differentiate types of homologs (better version of this graph with color can be generated in CoGe).

You have selected the log10(ks) cutoff value to be 0. Again, this should ideally separate two peaks in the middle graph.

3 Major Syntenic Blocks


Regions of synteny are not confined to a 1:1 relationship between Sorghum and Maize chromosomes. Where multiple regions in Maize are syntenic to the same region on Sorghum, we can assume this is a result of the whole genome duplication even. These are the regions which will be sorted into subgenome 1 and subgenome 2.

Table
Chromosome b34889_1 b34889_2 b34889_3 b34889_4 b34889_5 b34889_6 b34889_7 b34889_8 b34889_9 b34889_10 total
a31607_1 2431 na na na 710 na na na 736 na 3877
a31607_2 na 895 na na na na 1532 na na na 2427
a31607_3 na na 1914 na na na na 1174 na na 3088
a31607_4 na na na 893 1475 na na na na na 2368
a31607_5 na 114 na 296 na na na na na na 410
a31607_6 na 1055 na na na na na na na 747 1802
a31607_7 486 na na 412 na 108 na na na 164 1170
a31607_8 206 na 192 15 na 21 na na na 229 663
a31607_9 na na na na na 765 na 419 na 58 1242
a31607_10 na na na na 202 406 na na 700 na 1308

There are 0 unaligned sorghum chromosome(s) in the above table. If there are any unaligned sorghum chromosomes, consider picking a new ks cutoff.

4 SubGenomes


In order to separate subgenome 1 from subgenome 2, we implement a greedy algorithm to collect non-overlapping (when projected on sorghum) syntenic blocks by size. Psuedocode of the algorithm is as follows: foreach sorghum chromosome, set the largest syntenic chromosome (by gene) as sub1. Foreach additional syntenic chromosome if doesn’t overlap what is already in sub1, add it to sub1. Else, if it does not overlap anything in sub2, add to sub2. Else toss. After all syntenic chromosomes are process for this sorghum chromosome, count the number of genes in sub1 and sub2. If sub2 is bigger, switch them. *Toss out “small” syntenic chromosomes (51 genes or less?), and consider a tolerance for how much overlap must occur to consider it a true overlap. Also consider switching to gene level overlap rather than whole-chromosome start/stop values.

In order to see if one subgenome tends to have more isoforms, we plot the density of isoform counts from each subgenome based on isoforms in the v4 release 32 GFF file. We scale the x axis using log10, as we have a very long tail on this data (i.e. most counts are 1, but a few range to nearly 600). The densities track each other very strongly, so there isn’t likely anything to look for here.

5 Compared to Paper

How many chromosomes with synteny to sorghum were correctly placed in subgenome 1 or subgenome 2? Items in “Paper” only are missed subgenome assignments. Items in “Greedy” only are false assignments. Items in center are correct. (currently sorted by hand)

6 GO Terms in Maize


First we show both how many unique GO terms have been assigned using a computational ev-code vs. an experimental ev-code. A few terms have been annotated using both or were assigned multiple times based on different publications, which is why n is greater than the number of unique GO terms in our set. We also break down the GO terms by type. A few terms may not have a type, typically because they have been made obsolete.

Second, we look at the same divisions but this time for all gene annotations (i.e. gene + GO Term + Ev-Code). Again, since some terms are assigned computationally and experimentally to the same gene or assigned multiple times based on different publications, n is greater than the number of unique GO term assignments in out set. The main takeaways here are that computationally assigned GO terms vastly outnumber experimental annotations, most annotations use MF terms, and there are a few terms falling through the cracks most likely due to being outdated.

7 GO Term Comparison by Subgenome


GO Terms by subgenome. If the GO Term is assigned to any gene in the subgenome, it is counted once for that subgenome. This is a presence/absence test.

The main takeaways here are that 1) there are differences in functional assignments between each subgenome, regardless of which filters I apply. 2) the GOSlim for plants is extremely restrictive, and probably not useful. 3) Sub1 has more functional assignments than sub2, although this may be a direct correlation to the larger size (by definition) of sub1.


GO Term assignments to genes by subgenome, which takes into account the gene-GO Term pairings and considers homeologs equivalent for the purposes of determining overlap in assignments.

The main takeaways here mirror the assessment from the GO Term presence/absence check above. There are many differences in assignments, although we can assume that the same terms are reused very often based on the differenced between unique terms and gene annotations.

8 Expression


Cases where the maize1 homolog is more or less expressed than the maize2 homolog

Overall expression patterns between subgenome 1 and subgenome 2 are highly similar. Below we show to views of the same data. In the density graph, we can see that subgenome 1 is slightly shifted to the right of subgenome 2 expression, but otherwise follows the same shape. Viewed as a boxplot, you can again see that the quartiles in subgenome 1 and 2 are largely the same. This also shows a few high expressing outliers on subgenome 1.

Paired t-test suggests that there is no difference in means between FPKM of homeologs in sub1 vs. sub2 (p-val > 0.05, so we accept null hypothesis of equal means):

## 
##  Paired t-test
## 
## data:  expressedPairs$FPKM_mean1 and expressedPairs$FPKM_mean2
## t = 1.769, df = 2853, p-value = 0.077
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3611981  7.0253531
## sample estimates:
## mean of the differences 
##                3.332077

I am curious if our definition of sub 1 and sub 2 (i.e. the larger syntenic block is defined as sub 1) is really the best way to decide which genes are in sub 1 and sub 2. It seems to me that if one could resonably expect that, given 2 identically duplicate genes, if there was no benefit to having duplicated function then it would be a matter of random chance which of the two duplicate genes would loose function over time. It also seems to me that the quickest way to “degrade” an unnecessary gene (evolutionarily speaking) is to reduce expression. So what happens if we pick the higher expressing gene to be in sub 1 and the lower expressing gene to be in sub 2? Not surprising, we can find differences in populations between high and low expressing pairs. I wonder if this holds across different maize lines?

9 Compare Homeolog Pairs


In all cases where there is a homeolog on both maize1 and maize2, the we want to look to several measures to see if the genes on one subgenome tend to be present in greater abundance compared to the other. We can define this comparison as either greater expression, greater protein abundance, or a greater number of isoforms available to the gene. We can compare using a factor to define a cutoff for “greater”. (i.e. 4-fold higher expression/abundance etc.).

First we look at expression values. We find that maize1 consistently overexpresses the maize2 pair on average across all available data, regardless of factor (1,2, or 4), and regardless of whether or not we allow NA values to be treated as ‘0’.


This same pattern holds for protein abundance data as well. Here we look at protein abundance data for the tissues that also have expression data.

We can see this same trend for all experiments across all factors (fold-difference of 1 to 9 shown). The blue line (neither gene has greater expression) grows as factor increases since we have a stricter and stricter definition of what counts as greater expression. Red (maize1) is always above green (maize2) indicating that more maize1 homeologs have greater expression over their maize2 pair than vice versa for all expriments and all factors. Red and green lines both tend to zero when factor is large enough. Protein data (not shown) follows this same pattern.


The number of isoforms each gene of the homeolog pairs is thought to express follows a pattern where maize1 tends to have more isoforms than maize2 homeologs.

Looking at a few individual cases (* these are calculated as maize1 > maize2, since maize2 > maize1 makes a negative fold change):

Top ten differentially expressed gene pairs (in log2)
Maize1_url Maize2_url Sample Value_maize1 Value_maize2 foldChange
Zm00001d039733 Zm00001d008668 Leaf Zone 2 (Stomatal) 4084.690 1.582643 11.333675
Zm00001d039733 Zm00001d008668 Ear Primordium 2-4 mm 3159.010 1.392160 11.147932
Zm00001d039733 Zm00001d008668 Leaf Zone 3 (Growth) 2335.750 1.137135 11.004267
Zm00001d039733 Zm00001d008668 Ear Primordium 6-8 mm 2839.653 1.534157 10.854053
Zm00001d039733 Zm00001d008668 Embryo 20 DAP 1904.780 1.039390 10.839672
Zm00001d044122 Zm00001d011438 Silk 2306.653 1.317030 10.774297
Zm00001d039733 Zm00001d008668 7-8 internode 1749.163 1.225893 10.478616
Zm00001d039733 Zm00001d008668 Leaf Zone 1 (Symmetrical) 2157.867 1.734310 10.281028
Zm00001d045431 Zm00001d036086 Root - Cortex 5 Days 1173.249 1.197500 9.936267
Zm00001d039733 Zm00001d008668 6-7 internode 1390.567 1.581110 9.780520
Top ten differentially abundant protein pairs (in log2)
Maize1_url Maize2_url Sample Value_maize1 Value_maize2 foldChange
Zm00001d045431 Zm00001d036086 Silk 56580.732 35.635414 10.632783
Zm00001d045431 Zm00001d036086 Endosperm Crown 27 DAP 71582.340 94.567236 9.564047
Zm00001d032748 Zm00001d014063 Leaf Zone 2 (Stomatal) 1691.444 2.484244 9.411233
Zm00001d045431 Zm00001d036086 7-8 internode 22308.627 49.224767 8.824002
Zm00001d045431 Zm00001d036086 Secondary Root 7-8 Days 46608.118 161.463500 8.173229
Zm00001d033850 Zm00001d013367 Ear Primordium 6-8 mm 99061.113 356.606790 8.117841
Zm00001d045431 Zm00001d036086 Primary Root 5 Days 47139.332 173.503759 8.085821
Zm00001d032748 Zm00001d014063 Leaf Zone 3 (Growth) 1399.480 5.360648 8.028267
Zm00001d039865 Zm00001d008619 Embryo 38 DAP 23635.479 100.652210 7.875431
Zm00001d021702 Zm00001d006539 Pericarp/Aleurone 27 DAP 7784.268 33.973148 7.840023
Gene pairs in the top 50 for both differential gene expression and differential protein abundance (in log2).
Maize1_url Maize2_url Sample foldChange_expr foldChange_prot
Zm00001d045431 Zm00001d036086 Root - Cortex 5 Days 9.936267 7.539404
Zm00001d045431 Zm00001d036086 Root - Meristem Zone 5 Days 9.724529 7.200059
Zm00001d045431 Zm00001d036086 Root - Elongation Zone 5 Days 9.563307 6.646152
Zm00001d045431 Zm00001d036086 6-7 internode 8.231741 7.013317
Zm00001d045431 Zm00001d036086 7-8 internode 7.930961 8.824002

Look at the top 10 pairs with highest isoform count diffs (foldChange in log2)

Top 10 pairs with highest isoform count diffs (foldChange in log2)
Maize1_url Maize2_url Value_maize1 Value_maize2 foldChange
Zm00001d044186 Zm00001d011386 315 5 5.977280
Zm00001d040748 Zm00001d009220 92 2 5.523562
Zm00001d028579 Zm00001d047831 34 2 4.087463
Zm00001d043900 Zm00001d011572 27 2 3.754888
Zm00001d040279 Zm00001d008338 23 2 3.523562
Zm00001d038190 Zm00001d010282 30 3 3.321928
Zm00001d027848 Zm00001d048318 18 2 3.169925
Zm00001d019087 Zm00001d007770 18 2 3.169925
Zm00001d040193 Zm00001d008398 86 10 3.104337
Zm00001d017798 Zm00001d051590 46 6 2.938600

Here is a scatter plot of isoform counts across gene pairs.

10 Expression and Abundance are Weakly Correlated

Fig 10.1:

Fig 10.1:

Fig 10.2: Considering only homeologous pairs where both genes were measured for expression and protein abundance, we look at fold change from maize2 expression to maize1 expression vs. the fold change from maize2 abundance vs. maize2 abundance. A slightly positive trend indicates a mild positive correlation between an increase in expression and an increase in abundance.

Fig 10.2: Considering only homeologous pairs where both genes were measured for expression and protein abundance, we look at fold change from maize2 expression to maize1 expression vs. the fold change from maize2 abundance vs. maize2 abundance. A slightly positive trend indicates a mild positive correlation between an increase in expression and an increase in abundance.

11 Data Exploration




Red Square = Zm00001d020592 Pericarp/Aleurone 27 DAP vs. Zm00001d005793

Yellow Square = Zm00001d003188 Pericarp/Aleurone 27 DAP vs. Zm00001d025753

Yellow X = Zm00001d039040 Mature Leaf 8

Pink Cross = Zm00001d015385 Mature Leaf 8

12 Wrapup


Report generated on: October 20, 2017